feat: Add optional lz4 compression support for arrays passed via `base64` or `binref` encoding by angela-ko · Pull Request #579 · pasteurlabs/tesseract-core

angela-ko · 2026-04-30T17:53:54Z

Relevant issue or PR

To be done prior to implementing in pasteur-types

Changes are basically identical to the changes here
https://github.com/pasteurlabs/pasteur-types/pull/358/changes

Following the design doc here - chose to start with lz4 as the minimal dependency option for compression, and we can add in more optional compression types once it's working
https://pasteurisi.atlassian.net/wiki/spaces/~71202060d9f9d7be6c427dafac7d77e930e293/pages/1191247903/Compression+-+Design+Options

Description of changes

Add optional dependency for lz4
Add compress/decompress to array_encodings and output_to_bytes
Updated cli and tesseract.py to support compression as well

Testing done

Unit testing

angela-ko · 2026-04-30T18:01:39Z

@dionhaefner @nmheim Let me know if this is what you meant by testing compression in tesseract?

dionhaefner · 2026-05-01T06:55:51Z

That's a good start, thanks @angela-ko ! As next step, please add minimal, meaningful end-to-end tests that cover this functionality - which I expect are going to fail because I do see some issues with how the new lz4 dependency is added :)

Once everything is passing end-to-end I'll have a closer look at the design choices here.

dionhaefner · 2026-05-01T06:56:30Z

And please outline your rationale for choosing lz4 specifically as part of the PR body.

codecov · 2026-05-11T02:35:04Z

Codecov Report

❌ Patch coverage is 75.75758% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.84%. Comparing base (0e08e21) to head (1f7ac9d).

Files with missing lines	Patch %	Lines
tesseract_core/runtime/array_encoding.py	76.31%	7 Missing and 2 partials ⚠️
tesseract_core/sdk/tesseract.py	68.42%	3 Missing and 3 partials ⚠️
tesseract_core/runtime/cli.py	83.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #579      +/-   ##
==========================================
+ Coverage   68.30%   77.84%   +9.54%     
==========================================
  Files          39       39              
  Lines        4635     4690      +55     
  Branches      754      770      +16     
==========================================
+ Hits         3166     3651     +485     
+ Misses       1224      727     -497     
- Partials      245      312      +67

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PasteurBot · 2026-05-11T02:38:43Z

Benchmark Results

ℹ️ No baseline found — all benchmarks marked as new.

Benchmarks use a no-op Tesseract to measure pure framework overhead.

Benchmark	Baseline	Current	Change	Status
`api/apply_1,000`	-	0.581ms	new	🆕
`api/apply_100,000`	-	0.583ms	new	🆕
`api/apply_10,000,000`	-	0.581ms	new	🆕
`cli/apply_1,000`	-	1704.817ms	new	🆕
`cli/apply_100,000`	-	1830.680ms	new	🆕
`cli/apply_10,000,000`	-	2017.813ms	new	🆕
`decoding/base64_1,000`	-	0.037ms	new	🆕
`decoding/base64_100,000`	-	0.531ms	new	🆕
`decoding/base64_10,000,000`	-	67.930ms	new	🆕
`decoding/base64+lz4_1,000`	-	0.040ms	new	🆕
`decoding/base64+lz4_100,000`	-	0.572ms	new	🆕
`decoding/base64+lz4_10,000,000`	-	115.237ms	new	🆕
`decoding/binref_1,000`	-	0.203ms	new	🆕
`decoding/binref_100,000`	-	0.240ms	new	🆕
`decoding/binref_10,000,000`	-	11.089ms	new	🆕
`decoding/binref+lz4_1,000`	-	0.210ms	new	🆕
`decoding/binref+lz4_100,000`	-	0.290ms	new	🆕
`decoding/binref+lz4_10,000,000`	-	40.529ms	new	🆕
`decoding/json_1,000`	-	0.107ms	new	🆕
`decoding/json_100,000`	-	9.102ms	new	🆕
`decoding/json_10,000,000`	-	1077.563ms	new	🆕
`encoding/base64_1,000`	-	0.042ms	new	🆕
`encoding/base64_100,000`	-	0.149ms	new	🆕
`encoding/base64_10,000,000`	-	29.679ms	new	🆕
`encoding/base64+lz4_1,000`	-	0.048ms	new	🆕
`encoding/base64+lz4_100,000`	-	0.348ms	new	🆕
`encoding/base64+lz4_10,000,000`	-	93.115ms	new	🆕
`encoding/binref_1,000`	-	0.316ms	new	🆕
`encoding/binref_100,000`	-	0.491ms	new	🆕
`encoding/binref_10,000,000`	-	20.579ms	new	🆕
`encoding/binref+lz4_1,000`	-	0.325ms	new	🆕
`encoding/binref+lz4_100,000`	-	0.705ms	new	🆕
`encoding/binref+lz4_10,000,000`	-	84.605ms	new	🆕
`encoding/json_1,000`	-	0.152ms	new	🆕
`encoding/json_100,000`	-	13.522ms	new	🆕
`encoding/json_10,000,000`	-	1417.693ms	new	🆕
`http/apply_1,000`	-	3.121ms	new	🆕
`http/apply_100,000`	-	9.025ms	new	🆕
`http/apply_10,000,000`	-	788.311ms	new	🆕
`roundtrip/base64_1,000`	-	0.088ms	new	🆕
`roundtrip/base64_100,000`	-	0.696ms	new	🆕
`roundtrip/base64_10,000,000`	-	94.201ms	new	🆕
`roundtrip/base64+lz4_1,000`	-	0.099ms	new	🆕
`roundtrip/base64+lz4_100,000`	-	0.938ms	new	🆕
`roundtrip/base64+lz4_10,000,000`	-	211.874ms	new	🆕
`roundtrip/binref_1,000`	-	0.539ms	new	🆕
`roundtrip/binref_100,000`	-	0.739ms	new	🆕
`roundtrip/binref_10,000,000`	-	32.002ms	new	🆕
`roundtrip/binref+lz4_1,000`	-	0.553ms	new	🆕
`roundtrip/binref+lz4_100,000`	-	1.008ms	new	🆕
`roundtrip/binref+lz4_10,000,000`	-	129.190ms	new	🆕
`roundtrip/json_1,000`	-	0.272ms	new	🆕
`roundtrip/json_100,000`	-	20.142ms	new	🆕
`roundtrip/json_10,000,000`	-	2476.531ms	new	🆕

Benchmark details

Runner: Linux 6.17.0-1018-azure x86_64

dionhaefner

Taking shape – let's get some clarity on high-level design decisions before diving into details.

…ssion sie if compression is set

…onal

dionhaefner · 2026-06-29T13:44:56Z

+### binref + lz4 compression
+
+Set `TESSERACT_BINREF_COMPRESSION=lz4` to compress arrays in `.bin` files. Each array is compressed individually, preserving offset-based random access. The compressed size is embedded directly in the buffer path (`<file>:<offset>:<compressed_size>`).


This now also applies to base64, correct?

dionhaefner · 2026-06-29T13:45:16Z

+def _lz4_frame():
+    import lz4.frame
+
+    return lz4.frame


Can live in global scope since the dep is now mandatory

dionhaefner · 2026-06-29T13:49:19Z

    output_path: str = "."
    output_format: supported_format_type = "json"
    output_file: str = ""
+    binref_compression: Literal["lz4"] | None = None


Only binref?

dionhaefner · 2026-06-29T13:49:55Z

    if array_encoding == "base64":
-        return _dump_base64_arraydict(arr)
+        return _dump_base64_arraydict(
+            arr, compression=context.get("base64_compression")


I suggest we use a single use_compression variable instead of format-specific ones.

dionhaefner

Thanks @angela-ko. Looking real good now, just a last few comments.

angela-ko marked this pull request as ready for review April 30, 2026 18:01

angela-ko requested review from apaleyes, dionhaefner, jpbrodrick89 and xalelax as code owners April 30, 2026 18:01

angela-ko marked this pull request as draft May 11, 2026 02:31

angela-ko force-pushed the ako/compression branch 2 times, most recently from 56294af to 17fb949 Compare May 11, 2026 18:21

angela-ko force-pushed the ako/compression branch 3 times, most recently from 8310d77 to dc6a43b Compare May 25, 2026 04:58

angela-ko marked this pull request as ready for review May 25, 2026 04:58